Segmentation of Chinese Long Sentences Using Commas

نویسندگان

  • Meixun Jin
  • Mi-Young Kim
  • Dongil Kim
  • Jong-Hyeok Lee
چکیده

The comma is the most common form of punctuation. As such, it may have the greatest effect on the syntactic analysis of a sentence. As an isolate language, Chinese sentences have fewer cues for parsing. The clues for segmentation of a long Chinese sentence are even fewer. However, the average frequency of comma usage in Chinese is higher than other languages. The comma plays an important role in long Chinese sentence segmentation. This paper proposes a method for classifying commas in Chinese sentences by their context, then segments a long sentence according to the classification results. Experimental results show that accuracy for the comma classification reaches 87.1 percent, and with our segmentation model, our parser’s dependency parsing accuracy improves by 9.6 percent.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Chinese sentence segmentation as comma classification

We describe a method for disambiguating Chinese commas that is central to Chinese sentence segmentation. Chinese sentence segmentation is viewed as the detection of loosely coordinated clauses separated by commas. Trained and tested on data derived from the Chinese Treebank, our model achieves a classification accuracy of close to 90% overall, which translates to an F1 score of 70% for detectin...

متن کامل

On Closed Task of Chinese Word Segmentation: An Improved CRF Model Coupled with Character Clustering and Automatically Generated Template Matching

This paper addresses two major problems in closed task of Chinese word segmentation (CWS): tagging sentences interspersed with non-Chinese words, and long named entity (NE) identification. To resolve the former, we apply Kmeans clustering to identify non-Chinese characters, and then adopt a two-tagger architecture: one for Chinese text and the other for non-Chinese text. For the latter problem,...

متن کامل

Sentence Segmentation Using IBM Word Alignment Model 1

In statistical machine translation, word alignment models are trained on bilingual corpora. Long sentences pose severe problems: 1. the high computational requirements; 2. the poor quality of the resulting word alignment. We present a sentence-segmentation method that solves these problems by splitting long sentence pairs. Our approach uses the lexicon information to locate the optimal split po...

متن کامل

Identifying Japanese-Chinese Bilingual Synonymous Technical Terms from Patent Families

In the task of acquiring Japanese-Chinese technical term translation equivalent pairs from parallel patent documents, this paper considers situations where a technical term is observed in many parallel patent sentences and is translated into many translation equivalents and studies the issue of identifying synonymous translation equivalent pairs. First, we collect candidates of synonymous trans...

متن کامل

Two-Phase LMR-RC Tagging for Chinese Word Segmentation

In this paper we present a Two-Phase LMR-RC Tagging scheme to perform Chinese word segmentation. In the Regular Tagging phase, Chinese sentences are processed similar to the original LMR Tagging. Tagged sentences are then passed to the Correctional Tagging phase, in which the sentences are re-tagged using extra information from the first round tagging results. Two training methods, Separated Mo...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2004